Tokenizing an Arabic Script Language
نویسنده
چکیده
In any natural language processing project, the input text needs to undergo tokenization before morphological analysis or parsing. For Arabic script languages the tokenization process faces more problems and it plays a more crucial role in natural language processing (NLP) systems for Arabic script languages. In this work we elaborate on some of these problems and present solutions for these. The research is based on a project for tokenization and parsing Persian, Arabic and Kurdish texts.
منابع مشابه
Identification of arabic word from bilingual text using character features
The identification of the language of the script is an important stage in the process of recognition of the writing. There are several works in this research area, which treat various languages. Most of the used methods are global or statistical. In this present paper, we study the possibility of using the features of scripts to identify the language. The identification of the language of the s...
متن کاملOn Cross-Script Information Retrieval
We address the problem of cross-script retrieval in the context of a microblog system such as Twitter. Specifically, we explore methods for using native Arabic script queries to retrieve Arabic tweets written in a Roman script known as Arabizi. For example, a query for “بباتك” would not match “kitab” even though an Arabic reader would see them as the same word. Moreover, because of the lack of ...
متن کاملA Study of Sindhi Related and Arabic Script Adapted languages Recognition
1. INTRODUCTION The character recognition of the Roman type of languages especially English has come near to perfection and it is also considered as one of the successful application in the field of computer vision. The work on Arabic script and other scripts is being continued on; but the languages adopting Arabic script is very little while the work on Sindhi language is near to its origin. T...
متن کاملSangam: A Perso-Arabic to Indic Script Machine Transliteration Model
Indian sub-continent is one of those unique parts of the world where single languages are written in different scripts. This is the case for example with Punjabi, written in Indian East Punjab in Gurmukhi script (a Left to Right script based on Devnagri) and in Pakistani West Punjab, it is written in Shahmukhi (a Right to Left script based on Perso-Arabic). This is also the case with other lang...
متن کاملArabic Script Web Document Language Identifications Using Neural Network
This paper presents experiments in identifying language of Arabic script web documents using neural network. There are some difficulties when identifying those languages in Arabic script such as Persian, Turkish, Urdu, Jawi etc. Since there is a vast amount of information presented to the internet users, it is crucial to find an appropriate method in language identification for a variety of tex...
متن کامل